use tok as fallback tokenizer by kootenpv · Pull Request #11 · soaxelbrooke/python-bpe

kootenpv · 2019-07-09T18:59:08Z

The tokenizer makes for a better default (faster and saner).

Targets #10

soaxelbrooke

I'd prefer to avoid changing the default tokenization behavior. I can see adding tok as an optional alternative in the Encoder constructor, though.

soaxelbrooke · 2019-07-10T17:54:54Z

README.md

-# [26, 108, 79, 104, 72, 24, 26, 117, 24, 9, 11, 8, 12, 10, 26, 90, 24, 26, 154, 56, 37, 149, 80, 169, 84, 24, 26, 156, 24]
+# [25, 102, 76, 77, 68, 24, 25, 149, 24, 13, 10, 11, 12, 25, 79, 24, 25, 135, 58, 37, 152, 81, 160, 108, 24, 25, 143, 24]
 print(next(encoder.inverse_transform(encoder.transform([example]))))
-# vizzini : he didn ' t fall ? inconceivable !


So using tok would change the default tokenization?

soaxelbrooke · 2019-07-10T20:17:37Z

For context, in NLP tasks it can be important to retain the usage of contractions, since it can be informative about other aspects of the author of the text.

kootenpv · 2019-07-11T07:38:31Z

@soaxelbrooke I personally think the potential benefit of retaining contraction information is more than compensated by the increased power of generalization by having it all normalized :)!

i.e. not and ' / t are not generalized, did and didn don't generalize. I deem this to be worse!

I'm up for further discussion. Meanwhile, how you would propose adding it as an option without also having the nltk dependency?

soaxelbrooke · 2019-07-15T16:46:15Z

@kootenpv I'm fine with people having different preferences on how they'd like those split up, my biggest concern here is changing the interface, which is a breaking change that would introduce bugs into dependent code. I haven't heard any complaints about the presence of NLTK, so I don't see any particular need to remove it. We could expose tok as an alternative word tokenizer in the Encoder constructor.

use tok as fallback tokenizer

8eecb19

soaxelbrooke requested changes Jul 10, 2019

View reviewed changes

soaxelbrooke added this to the v1.1 milestone Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use tok as fallback tokenizer#11

use tok as fallback tokenizer#11
kootenpv wants to merge 1 commit intosoaxelbrooke:masterfrom
kootenpv:master

kootenpv commented Jul 9, 2019 •

edited

Loading

Uh oh!

soaxelbrooke left a comment

Uh oh!

soaxelbrooke Jul 10, 2019

Uh oh!

soaxelbrooke commented Jul 10, 2019 •

edited

Loading

Uh oh!

kootenpv commented Jul 11, 2019 •

edited

Loading

Uh oh!

soaxelbrooke commented Jul 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kootenpv commented Jul 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soaxelbrooke left a comment

Choose a reason for hiding this comment

Uh oh!

soaxelbrooke Jul 10, 2019

Choose a reason for hiding this comment

Uh oh!

soaxelbrooke commented Jul 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kootenpv commented Jul 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soaxelbrooke commented Jul 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kootenpv commented Jul 9, 2019 •

edited

Loading

soaxelbrooke commented Jul 10, 2019 •

edited

Loading

kootenpv commented Jul 11, 2019 •

edited

Loading